PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

sentential.tex (34120B)


      1 \section{Supervised Sentential Extraction Models}
      2 \label{sec:relation extraction:sentential}
      3 In the supervised setup, all variables listed in Table~\ref{tab:relation extraction:supervised samples} are given at train time.
      4 During evaluation, the relation must be predicted from the other three variables: sentence, head entity and tail entity.
      5 The predictions for each sample can then be compared to the gold standard.%
      6 \sidenote{
      7 	When a distant supervision dataset is used, ``gold standard'' is somewhat a misnomer.
      8 	In this case, the relation labels are often referred to as a ``silver standard'' since they are not as good as possible.
      9 }
     10 We introduce the commonly used metric for evaluation on a supervised dataset in Section~\ref{sec:relation extraction:supervised evaluation}.
     11 The following sections focus on important supervised methods, including weakly-supervised and semi-supervised methods.
     12 These sections focus on sentential relation extraction methods, which realize Equation~\ref{eq:relation extraction:sentential definition}.
     13 In contrast, Section~\ref{sec:relation extraction:aggregate} focuses on aggregate methods, which often build upon sentential approaches.
     14 
     15 \subsection{Evaluation}
     16 \label{sec:relation extraction:supervised evaluation}
     17 Since supervised relation extraction is a standard multiclass classification task, it uses the usual \fone{} metric, with one small tweak to handle directionality.
     18 As for training, we use samples from \(\dataSet_\relationSet\subseteq\sentenceSet\times\entitySet^2\times\relationSet\) for evaluation.
     19 Let's call \(x\in\dataSet\subseteq\sentenceSet\times\entitySet^2\) an unlabeled sample, and \(g\colon\dataSet\to\relationSet\) the function which associates with each sample \(x\) its gold label in the dataset (as given by \(\dataSet_\relationSet\)).
     20 Similarly, let's call \(c\colon\dataSet\to\relationSet\) the function which associates with each sample \(x\) the relation predicted by the model we are evaluating.
     21 The standard \fone{} score for a relation \(r\in\relationSet\) can be defined as:
     22 \begin{equation*}
     23 	\operatorname{precision}(g, c, r) =
     24 		\frac{|\{\,x\in\dataSet \mid c(x) = g(x) = r\,\}|}{|\{\,x\in\dataSet \mid c(x) = r\,\}|} = \frac{\text{true positive}}{\text{predicted positive}}
     25 \end{equation*}
     26 % Page break here
     27 \begin{align*}
     28 	\operatorname{recall}(g, c, r) & =
     29 		\frac{|\{\,x\in\dataSet \mid c(x) = g(x) = r\,\}|}{|\{\,x\in\dataSet \mid g(x) = r\,\}|} = \frac{\text{true positive}}{\text{labeled positive}}\\
     30 	\fone(g, c, r) & =
     31 		\frac{2}{\operatorname{precision}(g, c, r)^{-1}\times\operatorname{recall}(g, c, r)^{-1}}.
     32 \end{align*}
     33 
     34 To aggregate these scores into a single number, multiple approaches are possible.
     35 First of all, micro-averaging: the true positives, predicted positive and labeled positive are averaged over all relations.
     36 In the case where all samples have one and only one label and prediction, micro-precision, micro-recall and micro-\fone{} collapse into the same value, namely the accuracy.
     37 However, when computing a micro-metric on a dataset containing the \textsl{other} relation (Section~\ref{sec:relation extraction:other}), the samples labeled \textsl{other} are ignored, making the difference between micro-precision and micro-recall relevant again.
     38 
     39 The second set of approaches uses macro-averaging, which means that the scores are averaged a first time for each relation before taking the average of these averages over the set of relations.
     40 This compensates for the class imbalance in the dataset since when taking the average of the averages, the score for a rare class is weighted the same as the score for a frequent class.
     41 The ``directed'' macro-scores are defined as usual:
     42 \begin{align*}
     43 	\overDirected{\operatorname{precision}}(g, c) & = \frac{1}{|\relationSet|} \sum_{r\in\relationSet} \operatorname{precision}(g, c, r) \\
     44 	\overDirected{\operatorname{recall}}(g, c) & = \frac{1}{|\relationSet|} \sum_{r\in\relationSet} \operatorname{recall}(g, c, r) \\
     45 	\overDirected{\fone}(g, c) & = \frac{1}{|\relationSet|} \sum_{r\in\relationSet} \fone(g, c, r).
     46 \end{align*}
     47 However, two other variants exist.
     48 These variants try to discard the orientation of the relationship by packing together a relation \(r\) with its reverse \(\breve{r}\).
     49 This allows us to evaluate separately the ability of the model to find the correct relation and to find which entity is the subject (\(e_1\)) and which is the object (\(e_2\)).
     50 The simplest way to achieve this is to simply ignore the orientation:
     51 \begin{equation*}
     52 	\overUndirected{\operatorname{precision}}(g, c) = 
     53 		\frac{1}{|\relationSet^\dagger|} \sum_{\{r, \breve{r}\}\in\relationSet^\dagger}
     54 		\frac{\big|\big\{\,x\in\dataSet \mid c(x), g(x) \in \{r, \breve{r}\}\,\big\}\big|}
     55 		{\big|\big\{\,x\in\dataSet \mid c(x) \in \{r, \breve{r}\}\,\big\}\big|},
     56 \end{equation*}
     57 where \(\relationSet^\dagger\) is the set of relations paired by ignoring directionality.
     58 The set \(\relationSet^\dagger\) is well defined, since for the datasets using this metric, \(\relationSet\) is closed under the reverse operation \(\breve{\ }\) with the notable exception of \textsl{other}.
     59 However, similarly to micro-metrics, \textsl{other} is often ignored altogether.
     60 It only influences the final metrics through the degradation of recall on samples mispredicted as \textsl{other} and of precision on samples mispredicted as not \textsl{other}.
     61 Following the definitions above, we can similarly define \(\overUndirected{\operatorname{recall}}\) and \(\overUndirected{\fone}\).
     62 
     63 Finally, as a compromise between the directed \(\overDirected{\fone}\) and undirected \(\overUndirected{\fone}\), the half-directed metric was designed:
     64 \begin{equation*}
     65 	\overHalfdirected{\operatorname{precision}}(g, c) = 
     66 		\frac{1}{|\relationSet^\dagger|} \sum_{\{r, \breve{r}\}\in\relationSet^\dagger}
     67 		\frac{\big|\big\{\,x\in\dataSet \mid g(x) \in \{r, \breve{r}\} \land c(x) = g(x)\,\big\}\big|}
     68 		{\big|\big\{\,x\in\dataSet \mid c(x) \in \{r, \breve{r}\}\,\big\}\big|}.
     69 \end{equation*}
     70 The key difference with the undirected metric is that while the prediction and gold must still be equal to \(r\) or \(\breve{r}\), they furthermore need to be equal to each other.
     71 Figure~\ref{fig:relation extraction:supervised metrics} gives a visual explanation using the confusion matrix.
     72 Note that the distinction between directed and undirected metrics can also apply to micro-metrics.
     73 \begin{marginfigure}
     74 	\centering
     75 	\input{mainmatter/relation extraction/supervised metrics.tex}
     76 	\scaption[Supervised metrics defined on the confusion matrix.]{
     77 		Supervised metrics defined on the confusion matrix.
     78 		Directed metrics consider green and blue to be different classes, the \(\overDirected{\operatorname{recall}}\) for the relation \(r\) is computed by dividing the number of samples in the dark green cell by the total number of samples in the green row.
     79 		Undirected metrics consider green and blue to be the same class, the \(\overUndirected{\operatorname{recall}}\) for this class is computed by summing the four cells in the center including the two hatched ones and dividing by the sum of the two rows.
     80 		Half-directed metrics also consider \(\{r,\breve{r}\}\) to form a single class but the \(\overHalfdirected{\operatorname{recall}}\) is computed by summing the two dark cells in the center---ignoring the two hatched ones---and dividing by the sum of the two rows.
     81 		\label{fig:relation extraction:supervised metrics}
     82 	}
     83 \end{marginfigure}
     84 
     85 In conclusion, the evaluation of supervised approaches varies along three axes:
     86 \begin{itemize}
     87 	\item Whether \textsl{other} is considered a normal relation or is only taken into account through degraded precision and recall on the other classes.
     88 	\item Whether the directionality of relations is taken into account.
     89 	\item Whether class imbalance is corrected through macro-aggregation.
     90 \end{itemize}
     91 
     92 We now describe supervised relation extraction models, starting in this section with sentential approaches.
     93 
     94 \subsection{Regular Expressions: \textsc{dipre}}
     95 \label{sec:relation extraction:dipre}
     96 Dual Iterative Pattern Relation Expansion (\textsc{dipre}, \citex{dipre}) follows the bootstrap approaches (Section~\ref{sec:relation extraction:bootstrap}) and thus assumes \hypothesis{pullback}.
     97 Compared to \textcite{hearst_hyponyms}, \textsc{dipre} proposes a simple automation for the \(\relationSet_\entitySet\times \dataSet\to \relationSet_\sentenceSet\) step---the extraction of new patterns---and applies it to the extraction of the ``\textsl{author of book}'' relation.
     98 To facilitate this automation and in contrast to \textcite{hearst_hyponyms}, it limits itself to two entities per patterns.
     99 \textsc{dipre} introduces the split-in-three-affixes technique illustrated by Figure~\ref{fig:relation extraction:dipre split}.
    100 \begin{marginfigure}
    101 	\centering
    102 	\input{mainmatter/relation extraction/dipre split.tex}
    103 	\scaption[\textsc{dipre} split-in-three-affixes method.]{
    104 		\textsc{dipre} split-in-three-affixes method.
    105 		The algorithm ran on \textsc{html} code, \texttt{<li>} marks a list item, while \texttt{<b></b>} surrounds bold text.
    106 		\label{fig:relation extraction:dipre split}
    107 	}
    108 \end{marginfigure}
    109 The entities split the text into three parts: prefix before the first entity, infix between the two entities and suffix after the second entity.
    110 This could be considered five parts with the two entities' surface forms since they are not part of any of the three affixes.
    111 This split reappeared in other works since, with the simplest methods assuming that the infix alone conveys the relation.
    112 Even in the case of \textsc{dipre}, all three affixes are considered, but the infix needed to match exactly, while the prefix and suffix could be shortened in order to make a pattern more general.
    113 All patterns are specific to an \textsc{url} prefix, which made the algorithm pick up quickly on lists of books, with the algorithm also handling patterns where the author appeared before the title with a simple boolean marker.
    114 
    115 In order to generate new patterns, \textsc{dipre} takes all occurrences with the same infix and with the title and author in the same order.
    116 To avoid pattern which are too general they use the following approximation of the specificity of a pattern:
    117 \begin{align*}
    118 	\operatorname{specificity}(\text{pattern}) & = -\log( P(\text{pattern matches})) \\
    119 	& \approx \text{total length of the affixes}.
    120 \end{align*}
    121 When this specificity is lower than a given threshold divided by the number of known books it matched, the pattern was rejected.
    122 In the experiment, the algorithm was run on a starting set of five \((\text{author}, \text{title})\) facts which generated three patterns, one of which is given in Figure~\ref{fig:relation extraction:dipre split}; these patterns produced in turn 4\,047 facts.
    123 As per \textcite{hearst_hyponyms}, the algorithm was then iterated once again on these new facts.
    124 The second iteration introduced bogus facts, which were removed manually.
    125 Finally, the third iteration produced a total of 15\,257 \textsl{author of book} facts.
    126 \Textcite{dipre} manually analyzes twenty books out of these 15\,257 and found that only one of them was not a book but an article, while four of them were obscure enough not to appear in the list of a major bookseller.
    127 
    128 A limitation of the bootstrap approaches assuming \hypothesis{pullback} is that this assumption naively entails the following:
    129 \begin{marginparagraph}
    130 	As a reminder from Section~\ref{sec:context:relation algebra}: \(\relationZero\) denotes the empty relation linking no entities together.
    131 	So \(r_1\relationAnd r_2 = \relationZero\) should be understood as ``if we take the relation linking together all the entity pairs connected at the same time (\(\relationAnd\)) by \(r_1\) and \(r_2\), we should obtain the relation liking no entities together (\(\relationZero\)).''
    132 \end{marginparagraph}
    133 \begin{assumption}[oneadjacency]{1-adjacency}
    134 	There is no more than one relation linking any two entities.
    135 
    136 	\smallskip
    137 	\noindent
    138 	\( \forall r_1, r_2\in\relationSet\colon r_1\relationAnd r_2 = \relationZero \)
    139 \end{assumption}
    140 Indeed, if a pair of entities is linked by two relations, this would implies a sentence containing these two entities also convey the two relations.
    141 By induction it follows that the two relations would actually be the same.
    142 
    143 The approach of \textsc{dipre} was subsequently used by other systems such as Snowball \parencite{snowball}, which uses more complex matching and pattern generation algorithms and formalizes the experimental setup.
    144 We now focus on another semi-supervised approach similar to bootstrap, which was important to the development of relation extraction methods.
    145 
    146 \subsection{Dependency Trees: \textsc{dirt}}
    147 \label{sec:relation extraction:dirt}
    148 Discovery of Inference Rules from Text (\textsc{dirt}, \citex{dirt}) also uses the \hypothesis{pullback} assumption but makes a single iteration of the bootstrap algorithm from a single example.
    149 Furthermore, \textsc{dirt} makes the pattern building \(\relationSet_\entitySet\times \dataSet\to \relationSet_\sentenceSet\) more resilient to noise and applies the algorithm to multiple relations.
    150 Another difference is that it factorizes the definition of \(\relationSet_\sentenceSet\) using dependency paths instead of regular expressions.
    151 Given a sentence, a dependency parser can create a tree where nodes are built from words, and the arcs between the nodes correspond to the grammatical relationship between the words.
    152 This is called a dependency tree and is exemplified by Figure~\ref{fig:relation extraction:dependency tree}.
    153 \begin{marginfigure}
    154 	\centering
    155 	\input{mainmatter/relation extraction/dependency tree.tex}
    156 	\scaption[Example of dependency tree.]{
    157 		Example of dependency tree given by \textcite{dirt} generated using the Minipar dependency parser.
    158 		The nodes correspond to words in the sentence, as indicated by the dashed line.
    159 		Each node is tagged by the part-of-speech (\textsc{pos}) of the associated word.
    160 		The arrows between the nodes are labeled with the dependency between the words.
    161 		The following abbreviations are used: \texttt{N} is noun, \texttt{V} is verb, \texttt{Det} is determiner, \texttt{subj} is subject, \texttt{obj} is object, and \texttt{det} is the determiner relation.
    162 		\label{fig:relation extraction:dependency tree}
    163 	}
    164 \end{marginfigure}%
    165 \begin{epigraph}
    166 		{Groucho Marx}
    167 		{Animal Crackers}
    168 		{1930}[The ambiguity of the prepositional phrase ``in my pajamas'' would be removed by a dependency tree. It can either be linked to the noun ``elephant'' or to the verb ``shot.'']
    169 	While hunting in Africa, I shot an elephant in my pajamas. How he got into my pajamas, I don't know.
    170 \end{epigraph}
    171 After building a dependency tree, we can take the path between two nodes in the tree, for example the path between ``John'' and ``problem'' in the tree of Figure~\ref{fig:relation extraction:dependency tree} is:
    172 \begin{indentedexample}
    173 	\(\leftarrow\)\texttt{N:subj:V}\(\leftarrow\)%
    174 find%
    175 	\(\rightarrow\)\texttt{V:obj:N}\(\rightarrow\)%
    176 solution%
    177 	\(\rightarrow\)\texttt{N:to:N}\(\rightarrow\)%
    178 \end{indentedexample}
    179 Note that lemmatization is performed on the nodes.
    180 \Textcite{dirt} state their assumption as an extension of the distributional hypothesis (see section~\ref{sec:context:history}):
    181 \begin{spacedblock}
    182 	\strong{Distributional Hypothesis on Dependency Paths:}
    183 	\emph{If two dependency paths occur in similar contexts, they tend to convey similar meanings.}
    184 \end{spacedblock}
    185 In the case of \textsc{dirt}, context is defined as the two endpoints of the paths.
    186 For example, the context of the path given above in Figure~\ref{fig:relation extraction:dependency tree} consists of the words ``John'' and ``problem.''
    187 As such, this can be seen as a probabilistic version of the \(\relationSet_\entitySet\times\dataSet\to\relationSet_\sentenceSet\) step.
    188 In order to ensure these paths correspond to meaningful relations, only paths between nouns are considered.
    189 For example, by counting all entities appearing at the endpoints of the path above, \textcite{dirt} observe that the following path have similar endpoints:
    190 \begin{indentedexample}
    191 	\(\leftarrow\)\texttt{N:subj:V}\(\leftarrow\)%
    192 solve%
    193 	\(\rightarrow\)\texttt{V:obj:N}\(\rightarrow\)%
    194 \end{indentedexample}
    195 Therefore, they can conclude that these two paths correspond to the same relation.
    196 The orientation of a path is not essential.
    197 If the subject of ``solve'' appears after its object in a sentence, we still want this path to be counted the same as the one above.
    198 As introduced in Section~\ref{sec:relation extraction:directionality}, this is a common problem in relation extraction.
    199 To solve this in a relatively straightforward manner, we simply assume all paths come in the two possible orientations, so for each sentence, the extracted path and its reverse are added to the dataset.
    200 We use a mutual information-based measure to evaluate how similar two set of endpoints are.
    201 Since counting all possible pairs would be too memory intensive---the squared size of the vocabulary \(|V|^2\) is usually in the order of the billion or more---we measure the similarity of the first and second endpoint separately.
    202 To measure the preference of the dependency path \(\pi\) to have the word \(w\in V\) appears at the endpoint \(\ell\in\{\leftarrow, \rightarrow\}\), the following conditional pointwise mutual information is used:%
    203 \begin{marginparagraph}
    204 	The similarity metric equations in \textcite{dirt} are quite informal.
    205 	In particular, they do not state that \(\ell\) has a special role as a conditional variable in the \(\pmi\) and erroneously designate the same value as \(\operatorname{mi}(\pi, m, \ell)\).
    206 	The equations given here are our own.
    207 \end{marginparagraph}
    208 \begin{align*}
    209 	\pmi(\pi, w\mid \ell) & = \log \frac{P(\pi, w \mid \ell)}{P(\pi\mid \ell)P(w\mid \ell)} \\
    210 	& = \log \frac{P(\pi, \ell, w)P(\ell)}{P(\pi,\ell)P(\ell,w)}.
    211 \end{align*}
    212 This quantity can be computed empirically using a hash table counting how many time the triplet \((\pi, \ell, w)\) appeared in the dataset.
    213 We can then compute the similarity between two paths given an endpoint \(\ell\) then take the geometric average for the two possible value of \(\ell\) to obtain an unconditioned similarity between paths:
    214 \begin{equation*}
    215 	\operatorname{sim}(\pi_1, \pi_2, \ell) = \frac
    216 		{\sum_{w\in C(\pi_1, \ell)\cap C(\pi_2, \ell)} \big( \pmi(\pi_1, w\mid \ell) + \pmi(\pi_2, w\mid \ell) \big)}
    217 		{\sum_{w\in C(\pi_1, \ell)} \pmi(\pi_1, w\mid \ell) + \sum_{w\in C(\pi_2, \ell)} \pmi(\pi_2, w\mid \ell)}
    218 \end{equation*}
    219 \begin{equation*}
    220 	\operatorname{sim}(\pi_1, \pi_2) =
    221 		\sqrt{\operatorname{sim}(\pi_1, \pi_2, \leftarrow)
    222 			\times \operatorname{sim}(\pi_1, \pi_2, \rightarrow)},
    223 \end{equation*}
    224 where \(C(\pi, \ell)\) designates the context, that is the set of words appearing at the endpoint \(\ell\) of the path \(\pi\).
    225 
    226 Using this similarity function, \textcite{dirt} can find sets of paths corresponding to particular relations by looking at frequent paths above a fixed similarity threshold.
    227 They evaluate their method manually on a question answering dataset.
    228 For each question, they extract the corresponding path and then look at the 40 most similar paths in their dataset and manually tag whether these paths would answer the original question.
    229 The accuracy of \textsc{dirt} ranges from \(92.5\%\) for the relation ``\textsl{manufactures}'' to \(0\%\) for the relation ``\textsl{monetary value of}'' for which no similar paths were found.
    230 
    231 \subsection{Hand-designed Feature Extractors}
    232 \label{sec:relation extraction:hand-designed features}
    233 The first supervised systems for relation extraction were designed for the template relations (\textsc{tr}) task of the seventh message understanding conference (\textsc{muc-7}).
    234 \begin{marginparagraph}
    235 	To put these results into perspective with latter work, note that \textcite{ie2} mention they ran their model a 167\,\textsc{mh}z processor with 128\,\textsc{mb} of \textsc{ram}.
    236 \end{marginparagraph}
    237 The best result was obtained by the \(\textsc{ie}^2\) system \parencite{ie2}, which relied on manual pattern development, with an \fone{} score of 76\%.
    238 A close second was the 71\% \fone{} score of the \textsc{sift} system \parencitex{sift}, which was devoid of hand-written patterns.
    239 \textsc{sift} builds an augmented parse tree of the sentence, where nodes are added to encode the semantic information conveyed by each constituent.
    240 New nodes are created using an algorithm akin to a probabilistic context-free grammar using maximum likelihood.
    241 The semantic annotations are chosen following co-occurrence counts in the training set, using dynamic programming to search the space of augmented parse trees efficiently.
    242 \textsc{sift} also uses a model to find cross-sentence relations, which represent 10--20\% of the test set.
    243 The predictions are made from a set of elemental features, one of which was whether the candidate fact was seen in a previous sample; this gives a slight aggregate orientation to \textsc{sift}, even though it is primarily a sentential approach (Section~\ref{sec:relation extraction:definition}).
    244 This first systematic evaluation of models on the same dataset set the stage for the development of the relation extraction task.
    245 
    246 Subsequently, several methods built upon carefully designed features.
    247 This is for example the case of \textcitex{maximum_entropy_re} who use the maximum entropy principle on the following set of features:
    248 \begin{marginfigure}
    249 	\centering
    250 	\input{mainmatter/relation extraction/syntactic parse tree.tex}
    251 	\scaption[Example of syntactic parse tree.]{
    252 		Example of syntactic parse tree generated by the \textsc{pcfg} parser \parencite{pcfg}.
    253 		The following abbreviations are used:
    254 			\texttt{S} (simple declarative clause),
    255 			\texttt{NP} (noun phrase),
    256 			\texttt{VP} (verb phrase),
    257 			\texttt{ADJP} (adjective phrase),
    258 			\texttt{NNS} (plural noun),
    259 			\texttt{NNP} (singular proper noun),
    260 			\texttt{RB} (adverb),
    261 			\texttt{JJ} (adjective).
    262 		In contrast to a dependency tree (Figure~\ref{fig:relation extraction:dependency tree}), the words correspond to the tree's leaves, while internal nodes correspond to constituents clauses.
    263 		\label{fig:relation extraction:syntactic parse tree}
    264 	}
    265 \end{marginfigure}
    266 \begin{itemize}
    267 	\item entities and infix words with positional markers,
    268 	\item entity types by applying \textsc{ner} to the corpus,
    269 	\item entity levels, that is whether the entity is a composite noun or a pronoun which was linked to an entity through coreference resolution,
    270 	\item the number of other words and entities appearing between \(e_1\) and \(e_2\),
    271 	\item whether \(e_1\) and \(e_2\) are in the same noun phrase, verb phrase or prepositional phrase,
    272 	\item the dependency neighborhood, that is the neighboring nodes in the dependency tree (see Figure~\ref{fig:relation extraction:dependency tree}),
    273 	\item the syntactic path, that is the path between the entities in the syntactic parse tree (see Figure~\ref{fig:relation extraction:syntactic parse tree}).
    274 \end{itemize}
    275 Let's call \((f_i(x, r))_{i\in\{1,\dotsc,n\}}\) the indicator functions which equal 1 iff \(x\) has feature \(i\) and convey \(r\).
    276 The maximum entropy principle states that a classifier should match empirical data on the observed space but should have maximal entropy outside it.
    277 Calling \(Q^*\) the optimal probability model in this sense, we have:
    278 \begin{align*}
    279 	Q^* & = \argmax_{Q\in \symcal{Q}} \entropy(Q) \\
    280 	    & = \argmax_{Q\in \symcal{Q}} \sum_{(x,r)\in\dataSet} - Q(x, r) \log Q(r\mid x) \\
    281 	    & = \argmax_{Q\in \symcal{Q}} \sum_{(x,r)\in\dataSet} - \empP(x) Q(r\mid x) \log Q(r\mid x),
    282 \end{align*}
    283 \begin{marginparagraph}[-12mm]
    284 	As a reminder, \(\empP\) denotes the empirical distribution.
    285 \end{marginparagraph}
    286 where \(\symcal{Q}\) is the set of probability mass functions matching observations:
    287 \begin{equation*}
    288 	\symcal{Q} = \left\{\, \text{p.m.f.\ } Q \,\middle|\, \expectation_{(x, r)\sim Q}[f_i(x, r)] = \expectation_{(x, r)\sim \empP}[f_i(x, r)] \,\right\}.
    289 \end{equation*}
    290 Given this setup, the solution is part of a very restricted class of functions:
    291 \begin{equation*}
    292 	Q^*(r\mid x; \vctr{\lambda}) \propto \displaystyle\exp \sum_{i=1}^n \lambda_i f_i(x, r).
    293 \end{equation*}
    294 The parameters \(\vctr{\lambda}\) are estimated using an algorithm called generalized iterative scaling (\textsc{gis}, \cite{gis}).
    295 Using this approach, \textcite{maximum_entropy_re} evaluate their model on a dataset succeeding \textsc{muc-7} called \textsc{ace} (to be precise, \textsc{ace}~2003, see Section~\ref{sec:datasets:ace} for details).
    296 They achieve an \fone{} of 52.8\% on 24 \textsc{ace} relation subtypes.
    297 
    298 \subsection{Kernel Approaches}
    299 \label{sec:relation extraction:kernel}
    300 Designing a set of low-dimensional features is a tedious task: a large set of features can be computationally prohibitive, while a small set of features is necessarily limiting since they can never completely capture the essence of all samples which live in higher dimension.
    301 The kernel approaches seek to avoid this limitation by comparing samples pairwise without passing through an explicit intermediary representation.
    302 To do so, a kernel function \(k\) is defined over pair of samples:
    303 \begin{equation*}
    304 	k\colon (\sentenceSet\times\entitySet^2)\times(\sentenceSet\times\entitySet^2)\to\symbb{R}_{\geq 0},
    305 \end{equation*}
    306 where \(k\) acts as a similarity measure and is required to be symmetric and positive-semidefinite.
    307 It can be shown that there is an equivalence between kernel functions and features space; for each kernel function \(k\) there is an implicit set of features \(\vctr{f}\) such that \(k(x_1, x_2) = \vctr{f}(x_1)\cdot\vctr{f}(x_2)\).
    308 However, some kernel function \(k\) might be computed without having to enumerate all features \(\vctr{f}\).
    309 
    310 This property is used for relation extraction by \textcitex{kernel_parse} who define a similarity function \(k\) between shallow parse trees.%
    311 \sidenote{A shallow parse tree is similar to a syntactic parse tree (Figure~\ref{fig:relation extraction:syntactic parse tree}) on a partition of the words of a sentence \parencite{shallow_parse_tree}.}
    312 The tree kernel is defined through a similarity on nodes with a recursive call on children nodes.
    313 The equivalent feature space would need to contain all possible sub-trees which are impractical to enumerate.
    314 \Textcite{kernel_parse} train a support vector machine (\textsc{svm}, \cite{svm}) and a voted perceptron \parencite{voted_perceptron} on a dataset they hand-labeled.
    315 \Textcitex{kernel_dependency} used a similar approach with a tree kernel, except that they used dependency trees (Figure~\ref{fig:relation extraction:dependency tree}) instead of syntactic parse trees.
    316 They trained \textsc{svm}s on the \textsc{ace}~2004 dataset (Section~\ref{sec:datasets:ace}), with their best setup reaching an \fone{} of 63.2\%.
    317 Finally, \textcitex{kernel_exploring} also trained an \textsc{svm} but directly used the dot product inside the feature space as a kernel.%
    318 \sidenote{In the same way that a kernel always corresponds to the dot product in a feature space, the reverse can be shown to be true too, since a Gram matrix is always semidefinite positive.}
    319 Extracting a wide variety of features, they were able to reach an \fone{} score of 74.7\% on the \textsc{ace}~2004 dataset.
    320 
    321 \subsection{Piecewise Convolutional Neural Network}
    322 \label{sec:relation extraction:pcnn}
    323 In the 2010s, machine learning models moved away from hand-designed features towards automatic feature extractors (Section~\ref{sec:context:history}).
    324 In relation extraction, this move was initiated by \textcite{rmvs} using an \textsc{rnn}-like model (Section~\ref{sec:context:rnn}), but it really started to gain traction with piecewise convolutional neural networks (\textsc{pcnn}, \citex{pcnn}).
    325 \textsc{pcnn}s perform supervised relation extraction using deep learning.
    326 In contrast to previous models, they learn a \textsc{cnn} feature extractor (Section~\ref{sec:context:cnn}) on top of word2vec embeddings (Section~\ref{sec:context:word2vec}) instead of using hand-engineered features.
    327 Furthermore, \textsc{pcnn} uses the split-in-three-affixes method of \textsc{dipre} (Figure~\ref{fig:relation extraction:dipre split}).
    328 They feed each affix to a \textsc{cnn} followed by a max-pooling to obtain a fixed-length representation of the sentence, which depends on the position of the embeddings.
    329 This representation is then used to predict the relation using a linear and softmax layer.
    330 While the global position invariance of \textsc{cnn} is interesting for language modeling, phrases closer to entities might be of more importance for relation extraction, thus \textsc{pcnn} also uses temporal encoding (Section~\ref{sec:context:attention lm}).
    331 Figure~\ref{fig:relation extraction:pcnn} showcases a \textsc{pcnn} model.
    332 
    333 \begin{figure*}[ht!]
    334 	\centering
    335 	\input{mainmatter/relation extraction/pcnn.tex}
    336 	\scaption[Architecture of a \textsc{pcnn} model.]{
    337 		Architecture of a \textsc{pcnn} model.
    338 		The model is only given a sentence that was split into three pieces; entities are ignored.
    339 		The embeddings of the words in each piece are concatenated with two positional embeddings.
    340 		Each piece is then fed to a convolutional layer, and a linear layer merges the three representations together.
    341 		At the softmax output, we obtain a probability distribution over possible relations given the sentence.
    342 		\label{fig:relation extraction:pcnn}
    343 	}[-2cm]
    344 \end{figure*}
    345 
    346 The setup described above can be used for sentential relation extraction.
    347 However, \textcite{pcnn} and subsequent works place themselves in the aggregate setup.
    348 Therefore, we will wait until Section~\ref{sec:relation extraction:pcnn aggregate} to delve into the training algorithm and experimental results of \textsc{pcnn}s.
    349 
    350 \subsection{Transformer-based Models}
    351 \label{sec:relation extraction:mtb sentential}
    352 Following the progression of Section~\ref{sec:context:sentence}, \textsc{cnn}-based models were soon replaced by transformer-based models.
    353 \Textcitex{mtb} introduce the unsupervised matching the blanks (\textsc{mtb}) model together with an in-depth study on the use of transformers for relation extraction.
    354 We will focus on the transformer extractor in this section and study the unsupervised model in Section~\ref{sec:relation extraction:mtb}.
    355 \Textcite{mtb} introduces several methods to extract an entity-aware representation of a sentence using \textsc{bert} (Section~\ref{sec:context:transformers}).
    356 These different methods can be characterized along two axes:
    357 \begin{description}
    358 	\item[Entity Span Identification,] that is how are the entities marked in the sentence.
    359 		This can be \emph{none}, meaning that the entities are not differentiated from the other words in the sentence.
    360 		It can be through \emph{entity markers}, i.e.~new tokens are introduced to mark the two entities' beginning and end, as showcased by Figures~\ref{fig:relation extraction:epgnn sentence representation} and~\ref{fig:relation extraction:emes}.
    361 		Finally, it can be through a special feature of \textsc{bert}: \emph{token type embeddings}; in this case, the embeddings of the entity tokens are added to another embedding representing the slot---either \(e_1\) or \(e_2\)---of the entity.
    362 	\item[Output Construction,] that is how a fixed-size representation is obtained from the sequence of token embeddings.
    363 		A first approach is to simply use the \textsc{cls} \emph{token} embedding, i.e.~the sequence's first token, which should encompass the whole sentence semantic (Section~\ref{sec:context:transformers}).
    364 		A second approach is to use \emph{entity max-pooling}: each entity is represented by the component-wise maximum along its tokens embeddings, the sentence is represented by the concatenation of its entities representations.
    365 		A variant of this, using mean pooling combined with the \textsc{cls} method, is used by \textsc{epgnn} (Figure~\ref{fig:relation extraction:epgnn sentence representation}).
    366 		These representations should better capture the semantic surrounding the entities, in contrast to the \textsc{cls} token, which captures the whole sentence's semantic.
    367 		Finally, a last option is to use the embeddings of the \emph{entity start markers}; this is the option illustrated by Figure~\ref{fig:relation extraction:emes} and has the advantage to lessen the dependence of the representation on the entity surface form (Section~\ref{sec:relation extraction:entity} describes why this could be desirable).
    368 \end{description}
    369 
    370 \begin{figure}[th]
    371 	\centering
    372 	\input{mainmatter/relation extraction/emes.tex}%
    373 	\scaption*[\textsc{mtb} entity markers--entity start sentence representation.]{
    374 		\textsc{mtb} entity markers--entity start sentence representation.
    375 		``Bentham'' was split into two subword tokens, ``Ben-'' and ``-tham'' by the \textsc{bpe} algorithm described in Section~\ref{sec:context:bpe}.
    376 		The contextualized embeddings of most words are ignored.
    377 		The final representation is only built using the representation of \texttt{<e1>} and \texttt{<e2>}.
    378 		However, note that these representations are built from all the words in the sentence using an attention mechanism (Section~\ref{sec:context:attention}).
    379 		In the original work of \textcite{mtb}, the representation extracted by \textsc{bert} is either fed through layer normalization \parencite{layernorm} or to a linear layer depending on the dataset.
    380 		\label{fig:relation extraction:emes}
    381 	}
    382 \end{figure}
    383 
    384 The best results obtained by \textsc{mtb} were with the entity markers--entity start method.
    385 This is the method we focus on from now on.
    386 We refer to this sentence representation model by the function \(\bertcoder\colon\sentenceSet\to\symbb{R}^d\) illustrated Figure~\ref{fig:relation extraction:emes}.
    387 Training is performed using a softmax layer of size \(|\relationSet|\) with a cross-entropy loss.
    388 Using a standard \bertArch{large} pre-trained on a \textsc{mlm} task, \textsc{mtb} obtains a macro-\(\overHalfdirected{\fone}\) of 89.2\% on the SemEval 2010 Task 8 (Section~\ref{sec:datasets:semeval}).